Skip to content

Conversation

CyprienBosserelle
Copy link
Owner

Test Jules coding assistant in optimising using shared memory

CyprienBosserelle and others added 8 commits April 29, 2025 14:31
…to enhance their performance.

Here's a summary of the changes:

I've optimized six CUDA device functions:
- updateKurgXGPU
- updateKurgYGPU
- updateKurgXATMGPU
- updateKurgYATMGPU
- AddSlopeSourceXGPU
- AddSlopeSourceYGPU

The main idea was to use a faster type of memory called shared memory. Here’s how I approached it:
1.  I looked for parts of the code where a thread needs data from its neighbors.
2.  For these parts, I set up shared memory to hold the data temporarily (e.g., for `XEv.h`, `XEv.zs`, `XGrad.dhdx`, `Patm`, `XFlux.Fqux`, etc.).
3.  At the beginning of each kernel, I copied the necessary data from the main global memory into these shared memory areas, including any extra data needed for calculations involving neighbors.
4.  I used a synchronization command (`__syncthreads()`) to make sure all data was loaded into shared memory before any calculations began, and also before writing any updated shared data back to the main memory.
5.  I updated the calculations to use the data in shared memory instead of directly accessing global memory.
6.  Finally, where needed, I copied the results from shared memory back to global memory (for example, with the flux arrays in the `AddSlopeSource*` kernels).

I used some predefined constants (`STATIC_MAX_BLOCK_X = 16`, `STATIC_MAX_BLOCK_Y = 16`, `SHARED_MEM_HALO_WIDTH = 1`) to set the size of the shared memory arrays. I also added some checks to prevent errors if the way the kernels are launched doesn't quite match these predefined sizes.

The logic for handling boundaries between different blocks of data (like when blocks have different levels of detail) still uses direct access to global memory.

These optimizations apply to both single and double-precision versions of the kernels. The goal is to reduce the time spent waiting for data from global memory, which should make the kernels run faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant